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Abstract The task of monitoring for a change in the mean of a sequence of 
Bernoulli random variables has been widely studied. However most existing 
approaches make at least one of the following assumptions, which may be vi- 
olated in many real-world situations: 1) the pre-change value of the Bernoulli 
I/"") ' parameter is known in advance, 2) computational efficiency is not paramount, 

and 3) enough observations occur between change points to allow asymp- 
totic approximations to be used. We develop a novel change detection method 
based on Fisher's Exact Test which does not make any of these assumptions. 
We show that our method can be implemented in a computationally efficient 
manner, and is hence suited to sequential monitoring where new observations 
are constantly being received over time. We assess our method's performance 
empirically via using simulated data, and find that it is comparable to the 
optimal CUSUM scheme which assumes both pre- and post-change values of 
the parameter to be known. 
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' 1 Introduction 

o ■ 

' We are concerned with the task of detecting changes in the mean of an equally 

spaced, discrete-time sequence of Bernoulli(#t) random variables. In most tra- 
ditional work on this probem, the observed data is assumed to be a fixed 
length sequence of realizations x±, . . . ,x± from the independent random vari- 
ables Xi,...,X t . Here, t is a known integer and is usually interpreted as 
denoting the time at which the observation was made. Recently, much work 
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has instead focused on the case where t is not fi xed (Woodalll. 1997), and the 
observations instead constitute a data stream ([Domingos and Hultenl . 120031 ) 
of an unknown and potentially infinite length. This data stream setting arises 
when new observations are being constantly rec eived over time, for example in 
high frequency finance ( Chen and Guptal . ll997l ). or more traditional industrial 
monitoring problems where the task is to continually monitor the output of a 
manufacturing p rocess to detect an increase in the number of defective items 
( Yeh et al. . 20081) . When working with data streams it is not known in advance 
how many observations will be received, and the fixed length assumption is 
hence inappropriate. We will refer to this data stream setting as sequential 
monitoring, and the remainder of the paper focuses on this case. As a point 
of terminology, we will refer to the i th observation as being a success if X{ = 1, 
and a failure if Xi = 

We assume that the value of the Bernoulli parameter 9 t is constant but 
unknown between each change point, and that the distribution of each of the 
Xi variables can hence be written as: 



X t ~ Bernoulli^), 9 t 



9 , if t <n 
h, if n < t < t 2 

h, if t 2 < t < T 3 



where each value of Tj denotes a change point. The task is to detect any changes 
which take place, as soon after they occur as possible, and to accurately es- 
timate the location of each change point. In sequential change detection, it is 
standard to try and detect each change point in sequence. A change detector 
is run until the first change is encountered. After this has been found, the de- 
tector is restarted and monitoring for the second change point begins. In this 
way, the task is reduced to the successive detection of individual change points, 
and the problems associated with attempting to detect multiple change points 
simultaneously are avoided. Therefore for the remainder of the paper we will 
treat the stream as if it contained at most a single change point, with the un- 
derstanding that our techniques could also be deployed on streams containing 
several. 

The perf ormance of sequential change de tectors is usually measured using 
two criteria ( Basseville and Nikiforov . 19931) : the expected time between false 
positive detections (denoted ARL , for Average Run Length), and the mean 
delay until a change is detected (denoted ARL{). If r denotes the true location 
of the change point, and f is the time at which the change detector signalled 
for a change, then we can define these formally as: 



ARL = E(t\9 = 6» ), ARL X = E{f - t\9 = ${). 

When f < r, a false positive is said to have occurred. These quantities play a 
similar role to the Type I and Type II errors in traditional hy pothesis testing. It 
is considered very important (jBasseville and Nikiforovl 119931 ) for a sequential 
change detector to have a known bound on the ARLq so that change points 
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can be assessed for statistical significance, for the same reason that hypothesis 
tests are required to have a bound on the Type I error probability. 

Within the traditional statistics literature, the analysis of the Bernoulli 
change detection problem ha s mainly focused on the case where the se quence 
Xi, . . . , X t has a fixed length (|Pettitd . [l980tlHinklev and Hinkfevl . ll970l) . These 
approaches suffer from several limitations which makes them inappropriate 
tools for the sequential monitoring problem. First, they can be computation- 
ally inefficient when used in such a context, since they require all previous data 
to be stored in memory, with test statistics being recomputed from scratch 
whenever a new observation is received. Second, although these approaches do 
provide a way to bound the Type I error when testing for a change in a fixed 
length sequence, this does not easily translate into a method for bounding 
the ARLq, which is required in sequential monitoring problems. Finally, they 
often use test statistics which only asymptotically have known distributions. 
Using asymptotic theory is valid when it can be assumed that there are many 
observations between each change point. However when changes are occur- 
ing frequently, the asymptotic distribution may diverge substantially from the 
exact small-sample distribution. 

Much work on the non-fixed length problem comes from the Quality Con- 
trol literature. When the Bernoulli random variables can be naturally split up 
into groups of size n > 1 and hence treated as Binomial r andom variab l es, th e 
p-chart is the standard tool used to monitor for change ( Montgomervl 12005 ) . 
However, the p-chart is not ideal when the observations are arriving individ- 
ually and n = 1, a situation commonly referred to as continuous inspection. 

Existing change detection methods for the n = 1 case can be categorized 
into those which are transformation-based, and t hose wh i ch us e the untrans- 
formed observations. An example of the former is iNelsorj Jl994), which treats 
the time between failures as defining a sequence of Geometric random vari- 
ables, and then uses a transformation to make this approximately Gaussian. 
Standard tools for detecting a change in the parameters of a Gaussian sequence 
can then be deployed.. Several approaches have been proposed which do not 
rely on transformations. Versions of the popular CUS UM chart which can be 
deploy ed o n untransformed observ ations are proposed in Reynolds and Stoumbod 
(|l999h an dlChang and Ganl (|200lh . A method using EWMA charts is also con- 
sidered in lYeh et all (|2008lh 

A key limitation of these existing approaches is that they assume the pre- 
change value of the Bernoulli parameter, 0q, is known exactly. However this 
is an idealization that may not be applicable in several real world situations, 
including those mentioned above. Although in some situations it may be pos- 
sible to estimate 9q from a reference sample which is known to come from 
the pre-change distribution, this will not always be the case. The performance 
of change detection algorithm s under missp ecification of the pre-change dis- 
tribution has been studied by iBraun ( 19991 ) , who find that this can have a 
significant effect on performance. 

In this paper, we are concerned with the task of monitoring for changes 
in the parameter 9 when there is no prior knowledge available regarding ei- 
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ther its pre- or post- change value. We use a test statistic which has an easily 
computable exact null distribution, which makes it suitable for monitoring se- 
quences which may not have a large number of available observations between 
each change point. Further,, our method is highly computationally efficient and 
is hence suitable for deployment in situations where observations are being re- 
ceived at high frequencies, such as the aforementioned data stream setting. 
We will focus on detecting an increase in the Bernoulli parameter since this 
tends to be more important in practice, although our methods are trivially 
extendable to two-sided change detection. 

Our work is based on the Change Point Model (CPM) framework described 



in lHawkins et al.l (]2003f ). as a tool to extend traditional methods designed for 
fixed-length sequences to to the sequential monitoring setting. The CPM was 
originally introduced to monitor for changes in a Gaussian mea n, but was 
later extended to Gaussian vari ances dHawkins and Zamba . 20051) . and non- 
parametric distributional shifts (jRoss et all 120111 ). We propose to extend this 
work to monitor a Bernoulli proportion. We have made R code implementing 
our method publicly available: details on this can be found in the Software 
Implementation section at the end of this paper. 

The remainder of the paper proceeds as follows. Section [2] begins with the 
problem of testing for a change in a Bernoulli parameter when dealing with a 
fixed length sequence. Section [3] then shows how the technique we develop can 
be extended to sequential monitoring, where observations are being received 
over time. A discussion of implementation issues follows in Section 21 and an 
empirical evaluation of performance is carried out in Section [5J 



2 The Bernoulli Change Point Model: Fixed Length Sequence 

We first consider the problem of detecting a change point in a fixed length 
sequence. In this case there are t Bernoulli observations, Xi, . . . , X t . We stress 
again that neither the pre- or post- change value of 9 t is assumed to be known, 
and we assume for now that the sequence contains at most one change point. 

For a specified point k in this sequence, we can use a standard two-sample 
hypothesis test to assess whether a change point occurs at r = k, with the 
null hypothesis being that there is no change and that all t observations are 
identically distributed. 

Several s uch tests exist in the statistics literature. We choose to use Fisher's 
Exact Test (lAgrestil fl992h (FET) since its null distribution can be computed 



exactly, rather than relying on Gaussian approximations which only hold 
asymptotically. This property is important since we would like our change 
detector to be deployable in situations where only a small number of obser- 
vations are available between change points. Another important property of 
FET is that its null distribution does not depend on the true value of 9 t - 

The idea behind FET is as follows: suppose the observations at time t are 
broken up into two samples x±, . . . , Xk and Xk+i, ■ ■ ■ , xt- Let the null hypoth- 
esis be that there are no change points in the sequence, which implies that 
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both samples have been generated by the same Bernoulli distribution with a 
fixed parameter 9q. Under this assumption, the Xi variables are identically 
distributed, with P(X l = 1) = 6 , and P(X l = 0) = 1 - 9 for all i. Let S t be 
a random variable defined as the number of failures in the first t observations, 
i.e: 

t 

»=i 

Then, conditional on St = St, FET uses a combinatorics argument to reason 
about how the number of observed failures are distributed between the two 
samples. Let Sk be the number of failures in the first sample. Under the null 
hypothesis, the probability that Sk — Sk follows a hypergeometric distribution: 

P(S k = s k \St = s t )= WAa:--^ (1) 

\k) 

where (?) is the binomial coefficient. A fundamental property of the FET 
is that this probability does not depend on the unknown parameter 9q. By 
conditioning on the value of the sufficient statistic St, this dependency has 
been removed. Therefore the p-values of the FET under the null hypothesis 
are independent of 9$ , which makes this test suitable for situations where this 
parameter is not known. Now, as noted in the Introduction, we will generally 
be more interested in detecting an increase in 9 t , which corresponds to an 
unusually small number of failures occuring within the first k observations. 
The probability of there being Sk or less failures in the first k observations 
under the null hypothesis that there is no change point and all observations 
are identically distributed is: 

k 

p k ,t = J2 P ( Sk = s ^> Pk,t€ [0,1]. 

2=1 

This is the one-sided p-value of the FET. For convenience, we will instead 
work with the statistic F^t, defined as: 

F k ,t = l-p k ,u F M e[0,l]. 

Finally, the null hypothesis that no change occurs at k is rejected if Fk,t > hk,t 
for some appropriately chosen threshold hk,t- 

Of course, since we have no prior knowledge of where the change point is 
located, we do not know which value of k to use for testing. The null hypoth- 
esis is now that there is no change point in the data, while the alternative 
hypothesis is that a change point exists at any location. To perform this test, 
we can use a metho d analogous t o the procedure followed in generalized likeli- 
hood ratio testing (|Pettitd . Il980h . We compute Fk t t at every point 1 < k < t, 
and use the maximum value. This leads to the statistic: 
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F t — max Fiit, 1 < k < t. 

k 

If Ft > h t for some suitably chosen threshold h t , then the null hypothesis is 
rejected, and we conclude that a change occured at some point in the data. 
In this case, the best estimate f of the location of the change point is at the 
value of k which maximized F t . If F t < h t , then we do not reject the null 
hypothesis, and hence conclude that no change has occurred. The choice of 
this h t threshold will be discussed further in the following section. 

3 The Bernoulli Change Point Model: Sequential Monitoring 

We now consider the task of change detection when new observations are being 
received in discrete time. Let X t be the t th observation where t 6 1, 2, ... is 
growing and perhaps unbounded. 

For each value of i, we can treat the sequence X\, . . . ,X t as being of a 
fixed length, compute F t , and use the methodology from the previous section 
to test whether a change point has occurred. If a change is detected, we give 
a signal, and stop monitoring. If no change is detected, then we wait until the 
next observation X t +\ is received and repeat this process, computing F t +\ and 
again testing for a change. In other words, we propose computing F t at each 
time point, and signalling that a change has occurred when F t > h t . In the 
case where the sequence may contain multiple change points, the monitoring 
process is then restarted immediately at the observation following the change, 
with all previous observations being discarded. 

Although recomputing F t whenever a new observation is received may seem 
computationally expensive, it can actually be computed in a very efficient 
manner, as will be discussed in Section 2] 

The key issue with this approach then becomes determining the sequence of 
thresholds {ht}. As mentioned in the Introduction, an important requirement 
for sequential monitoring algorithms is that the rate of false alarms can be 
bounded, where a false alarm constitutes flagging that a change has occurred 
when no change has taken place. In other words, assuming that no change has 
occurred, we ideally would have: 

P(F t > h t \F t -x < h t -x, ...,F 1 <h 1 ,9 t = d ) = a, (2) 

where a is some user-specified constant. In this case, the expected time between 
false alarms (again denoted as the ARLq, for Average Run Length) is equal to 
1/a. Because the finite-sample conditional distribution of F t cannot generally 
be computed analytically, the usual approach when working with change point 
models is to use Monte Carlo simulation to compute the sequence of h t values 
corresponding to the desired ARLq. This can be a computationally expensive 
procedure which is not feasible to carry out in real-time as data is being 
received, therefore the usual solution is to compute the ht values corresponding 
to many different choices of the ARLq in advance, and then store these in a 
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looku p table so they can be easily used ( Hawkins and ZambaL 20051 Ross et al 



201 lh . 

However, a problem arises - unlike the examples considered in the existing 
CPM literature, the FET test involves conditioning on the observed number of 
successes, as can be seen from Equation [TJ This implies that the thresholds h t 
used in the CPM should be conditional on the particular data sequence {X t } 
that been observed, with different sequences requiring different thresholds. 
Since a collection of n Bernoulli random variables has 2™ possible realisations, 
it is he nce not possible to use the sort of precomputed lo okup table which is 



used in Hawkins and Zambal <|2005h and lRoss et all (|201lh . 



As a solution to this problem we design the CPM in a conservative fashion. 
In general, pt is smallest (with F t being largest) when 8q — O.5.. Therefore, in 
order to generate a threshold sequence ht that will an give an ARLq of at least 
1/a, we simulate Bernoulli sequences under the assumption that 8q = 0.5 (full 
details of this simulation follow below). Because other values of 6q result in 
lower values of F t , this will result in an ARLq which is greater than 1/a, i.e. 
we will have the more conservative criteria: 

P{F t > ht\F t -i < ht-i, ...,F 1 <hi,0 t = O o )<a. (3) 

One final complication which can arise is the discreteness of the test statis- 
tic Fk ■ Because the sequence of X t random variables has a Bernoulli distribu- 
tion, there is only a finite number of values which Fk can achieve. This can 
potentially add a further degree of conservativeness to the procedure. 

I n order to reduce t he discreteness of the test statistic, we borrow an idea 
from Zhou et al.l (|2009h and recommend smoothing the Fk values. Specifically, 



we define a new statistic Y t which is formed by applying exponential smoothing 
to the Fk,t statistics, 



Yi, t = F ltt 

Y k ,t = (1 - A)n_i, t + XF k ,t 
Y t = max kY k>u A G [0, 1]. (4) 

The underlying idea is that if a change point occurs at r, then the values 
of Fk <t should be high whenever k is close to r. Therefore, smoothing the Fk,t 
statistics in this way should not negatively impact performance. Further, by 
using the smoothing, the range of possible values for each Yk t is significantly 
increased compared to the Fkt statistics, hence Y t has more possible values 
than F t . This allows a sequence of thresholds to be chosen which give an ARL 
closer to the desired value. We are now faced with the issue of choosing A; 
generally, a value close to results in more smoothing and produces a change 
detector which is slightly better at detecting smaller shifts in t , while a higher 
value results in more smoothing and is hence more suitable for detecting larger 
shifts. In our Experiments section we will explore several different choices of 
A and show that the performance difference is not overly large. 
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00 


Empirical ARLo 




A = 0.1 


A = 0.3 


0.01 


971 (862) 


1039 (765) 


0.05 


622 (543) 


638 (597) 


0.10 


589 (489) 


529 (530) 


0.20 


512 (461) 


516 (491) 


0.30 


509 (460) 


509 (474) 


0.40 


504 (471) 


507 (483) 


0.50 


500 (493) 


500 (494) 



Table 1: Empirical ARLq when the CPM is designed with to have an ARLq 
of 500, for several values of 6q. Standard deviations are shown in brackets. 



We now return to the original problem of finding a sequence of thresholds 
which satisfies Equation [3] This is non-trivial; the marginal distribution of 
the F t and Y t statistics is complex, and the conditional distribution in Equa- 
tion [5] moreso. Since it does not seem possible to determine the thresholds 
analytically, we instead compute them using a Monte Carlo technique . In 
order to compute the thresholds for some choice of a, we simulate one million 
streams each containing 2000 Bernoulli(0.5) observations. For each stream, we 
compute Y t at each observation. The required values of ht can then be found 
successively, by starting with the first observation and choosing hi to make the 
proportion of Y\ values exceeding hi equal to a. The streams which exceed 
a are then discarded, and h 2 can then be chosen so that the proportion of 
remaining Y2 values exceeding h 2 is equal to a, and so on. In this way, the 
conditional distributions are successively approximated, allowing the threshold 
sequence to be found. 

Although this simulation is computationally expensive and may require 
several hours of processing, this is not a problem since it only needs to be 
performed a single time, and can hence be performed in advance. Then, once 
we have computed these values, we can store them in a lookup table which can 
be accessed with no computational overhead. We present such a lookup table in 
Table [5] in the Appendix, which gives the threshold sequences corresponding 
to several choices of a. It can be seen that these thresholds seem to slowly 
converge towards constant values, so it seems reasonable to use the value of 
as an approximation of h t for t > 2000. 

Since we have computed these thresholds under the assumption that 9o = 
0.5, the CPM will be conservative when the parameter of the Bernoulli se- 
quence is not equal to this value, causing the ARL to be higher than desired. 
In order to investigate just how conservative this procedure is, Table [1] shows 
the actual empirical ARLo values that are achieved by a CPM designed to 
have ARLo — 500, when 0q takes on a variety of other values. It can be seen 
that the degree of conservativeness is quite small, unless po takes on a very 
small value (<0.01). The CPM hence seems suitable for use on most Bernoulli 
streams, which we will investigate further in Section [3] 
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4 Implementation Issues 

Having completed the description of the Bernoulli CPM, we now turn towards 
its implementation. In many important real world scenarios, computational 
resources are limited so it is important to have a change detection algorithm 
which can be computed efficiently. For the CPM, the majority of computation 
time is spent calculating the Fkj statistics. From Equation [T] we can see that 
this is equivalent to evaluating the probability mass function of the hyperge- 
ometric distribution. Most common statistical packages will provide a highly 
optimized routine for this. However, we can increase computational efficiency 
by exploiting the high level of correlation between the Fk t t statistics. 

Consider a fixed length sequence containing t observations, of which St are 
failures. Let Sk be the number of failures observed before observation k. Write: 



7 (**)(*- 

'■k. I — 



t-s t \ 



{ N\ 



Now, rather than computing dk+i.t from scratch, we can compute it re- 
cursively from dk,t- We make use of the following identities for the binomial 
coefficient: 



n \ n 



— k (n + 1\ fn\ n + 1 



K k + lJ \kj k + 1 V k J V k 
From this, algebraic manipulation shows that: 



jfc,t(st + l)(t-S f +l)(t+l-S f ) 



if X t+ i = 1 



, _ I (k + l-S k )(t-S t -k-S k +l)(t + l) ^t+L X /r\ 

"M+l - \ d ktt (t-s t + l)(t+l-s t ) . r ^ „ W 



(t_ St _ fc _ Sfc+ i)( t+ i) tiXt+i 



And similarly: 



{dk,t(st-s k )(k+l) ;f v _ 1 

(s k + l)(t-k) H A k+1 - 1 
— {k _ Sk+m _ k) ll X k+1 = U 

Using these recursive formulations significantly decreases the processing time 
required to compute each value of F^t- Recall that Fk : t is defined as: 



k 

*k,t ' ^_ 
i=l 

Therefore for all k < t, the value of dk,t can be calculated from dk : t-i using 
Equation O without the need to evaluate any factorials. When i = t this does 
not apply, since dtj-i is not defined. In this case, dk,t can be computed from 
dk-i,t using Equation[6] 

However even though this recursive formulation allows efficient computa- 
tion of F t , and hence the smoothed Y t statistics, the time required to com- 
pute these values still grows linearly over time as more observations are re- 
ceived. Further, it also requires all previous data points to be retained in mem- 
ory. In situations where this is not feasible due to constraints on processing 
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power/memory, we can use a further efficiency device where only the previous 
w observations are stored in memory, with the previous t — w discarded. 

We define a window W w .t of length w to be the set of the w most re- 
cently observed points, i.e. W Wt t = {xt- w +i, ■ ■ ■ ,%t}- Only the points in this 
window are stored in memory, with older points discarded. The F t statistic is 
now calculated by maximising F^t over only the observations in the window, 
rather than over the whole stream. Therefore, the memory only needs to be 
large enough to contain the previous w points, and the computational cost of 
computing F t is constant rather than growing over time. 

A naive implementation of windowing where old points are discarded would 
have a significant negative effect on the performance of the change detector. 
Therefore, we do not discard old points entirely, but instead summarise them 
in the sufficient statistic St- W which maintains the sum of the observations too 
old to be included in the window. 

In other words, suppose that at time t, there have been St_ lu failures outside 
the current window. We define: 

k 

Sk = st-iu + Xi, t - w < k < t. 

i—t — w+l 

The Fk.t statistics are then defined for t — w < k < t, and maximisa- 
tion is carried out only in this range. Because Sk,t is a sufficient statistic for 
X%, . . . , X w , no information is lost by discarding these points, and values of 
F t at each point in the window is identical to the value when no windowing is 
used, meaning that no information is lost, so no loss in performance will occur. 
Also because older points are summarised in St- W , the choice of the window 
size w is not critical since it only determines the points at which a change may 
be detected. 



5 Performance Analysis 

Having introduced the Bernoulli CPM, we now proceed to evaluat e its perfor- 

mance. We compare it to the Bernoulli CUSUM chart introduced in Reynolds and Stoumbosl 



(If 9991) . Jnlike the CPM, the CUSUM requires both the pre- and post- change 



values of 9 t to be known, and is the optimal change detector und er this assump 
tion, in the sense of minimising the worst case detection delay ( Lordenl . ll97ll) 



Since our CPM assumes nothing about 6?t, we would expect its performance 
to be inferior, so we consider the CUSUM as a benchmark which defines op- 
timal performance. Because designing an optimal CUSUM is unrealistic since 
generally the values of 8q and Q\ will not be known, we also investigate the 
effect that misspecification of these parameters has on the CUS UM. 



Following standard practice in the change detection literature (jBasseville and Nikiforov 



f993(), we evaluate performance by setting the ARLq of both methods to be 
equal, and then compare the expected delay taken to detect changes of various 
magnitudes. In order to this we need to choose some value for the ARLq, so 
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we choose ARLq = 500 for both methods. Our results apply without loss of 
generality to other choices of the ARLq. 

Since our CPM assumes no prior knowledge of the pre-change parameter 
9q and instead require it to be learned from the data, the location of the 
change point will affect their performance. A change which occurs early will 
be more difficult to detect than one which occurs late, since fewer observations 
are available to allow 9q to be estimated accurately. We therefore consider two 
different change point locations, t = {50,300}, which correspond to early 
and moderately located changes. In the case of sequences containing multiple 
change points, this corresponds to how far apart the change points are located. 

We now briefly review the CUSUM before proceeding with the performance 
comparison. 



5.1 CUSUM 



Our i mplementation of the CUSUM chart follows Reynolds and Stoumbosl 



(1999). Given a sequence of Bernoulli observations Xi,X2, ■ ■ ■ with known pa- 



rameter 9q before the change point and 9\ after, we define: 

C = 

C t = max(0, C t -i + x t — k), 
where A: is a reference value defined as k — ri/r-2, where: 

ri = - log — , r 2 = log 



1-V °\9 n {l-9 1 ) / 

A change is flagged when Ct > h(9o, 9\) for some appropriately chosen control 
limit h(9o, 9\), which is chosen in order to give the CUSUM a specified ARLq. 
This can be done using ( for ex ample) the approximation scheme discussed in 
Reynolds and Stoumbosl ( 19991 ). or by Monte-Carlo simulation. We note that 



due to the discreteness of the test statistic, it will generally not be possible to 
achieve a target ARLq exactly. Although we previously discussed this problem 
in the context of our CPM, it is more serious for the CUSUM since the Ct 
statistic is more discrete. For each choice of 9o and 6-y, we computed control 
limits which gave the CUSUM the minimum possible ARLq greater than 500. 
In practice however, the achieved ARLqS were in the range 500-600, but for 
cases where the test statistic is especially discrete, most notably when 9$ = 0.1 
and 9\ = 0.9, it is not possible to find any threshold greater between 400 and 
800. 



5.2 Results and Discussion 



We wish to compare the expected delay taken by both the CPM and the 
CUSUM to detect changes of various magnitues in a Bernoulli parameter. 
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Table 2: Expected delay taken to detect a change from 9q to 9\ when the change 
occurs at point r = 300. Standard deviations are provided in the Appendix. 
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7.3 






0.4 


0.00 


0.00 


0.00 


252.8 


96.8 
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14.8 
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Table 3: Expected delay taken to detect a change from 9q to 9\ when the change 
occurs at point r = 50. Standard deviations are provided in the Appendix. 



We consider cases where the pre-change value of the Bernoulli parameter f?o 
takes values from the set {0.1,0.2,0.3,0.4}, and the post-change value is 9\ S 
{f? + 0.1, 9 + 0.2, . . . , 0.9}. By the symmetry of the Bernoulli distribution, 
the results for 9o = 7 will be identical to those for #0 = 1 — 7, so we do not 
consider cases where 9q > 0.5. 

For each choice of the parameters 9q and 9\, and the change time r, we 
generated 20000 sequences of Bernoulli^) observations where 9 t = 9q when 
t < t, and 9 t = 9\ when t > r. We then configured the CUSUM to be optimal 
for these parameter values, by choosing k as described in Sect ion I5TT1 The CPM 
of course does not have any knowledge of these parameter values. For each 
sequence, we then ran both the CPM and the CUSUM, and found the time T 
at which a change was flagged. The expected delay is then E[T\T > r]. Recall 
that the CPM uses a parameter A which controls the degree of smoothing 
applied to the Fk,t statistics; we ran the CPM using both A = 0.1 and A = 0.3 
to investigate the effect this has on performance. 
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01 


Detector 


00 


0.2 0.3 0.4 0.5 0.6 0.7 0.8 


Misspecified CUSUM 


0.1 
0.2 
0.3 
0.4 


497.0 70.5 30.1 16.6 11.0 8.0 6.0 
0.00 484.3 80.2 33.0 20.0 13.2 8.7 
0.00 0.00 487.2 87.2 35.5 19.7 13.3 
0.00 0.00 0.00 482.4 89.0 36.3 19.8 



Table 4: Average delay required to detect a change from 0q to B\ when the 
change occurs at point r = 50. Standard deviations are provided in the Ap- 
pendix. 



The expected delays are presented in Table [5] for the case where t = 300, 
and Table [3] for r = 50. Several aspects of these results deserve comment: 

— As expected, the CPM performs substantially better when the change oc- 
curs after 300 observations, than when it occurs after only 50. This will 
be the case with any change detection algorithm which does not have full 
knowledge of the pre-change distribution and must instead learn it online. 
When only 50 observations are available, it is difficult to form an accurate 
estimate of 9q and performance suffers. 

— The CPM does not seem overly sensitive to the choice of A used for smooth- 
ing. When A = 0.1, the CPM is slightly better at detecting small changes, 
while being worse at detecting larger ones, since the increased smoothing 
causes the large jump in the test statistic to be partially flattened out. As 
in the case of EWMA charts, the 'best' choice of A will depend on the mag- 
nitude of change for which it is most important to minimize the detection 
delay. 

— The performance of the CPM is very close to the CUSUM when the change 
occurs after 300 observations, demonstrating its usefulness. However, it is 
markedly worse when the change occurs after only 50 observations. It is 
important not to read too much into the relatively poor performance of 
the CPM in the latter case, since the CUSUM is of course designed under 
full knowledge of the Bernoulli parameters. It is hence only a measure of 
the best possible performance, when everything is known exactly. We will 
explore this further in the following section. 



5.3 Effect of Parameter Misspecification on the CUSUM 

The CUSUM used above was designed based on complete knowledge of the true 
values of 9q and 9\ , and under these assumptions it is the optim al sequential 
change detector, in the minimax sense described bv lLorde"nl (1971). However in 



practice this optimality is of dubious value, since it is rare that these param- 
eters will be known exactly, and practical deployments of the CUSUM must 
take into account the possibility of parameter misspecification. 

In order to get a more realistic picture of how the CPM compares to the 
CUSUM, we now investigate the effect that such parameter misspecification 



14 



Gordon J. Ross et al. 



has on the performance of the CUSUM. This is most likely to happen when 
the change occurs early during the monitoring, since there may be insufficient 
observations to allow 9q to be accurately estimated. We therefore consider the 
case where the change occurs at observation r = 50. Suppose that instead of 
being designed based on the true value of 9q, the CUSUM is instead designed 
based on 9 = 6*0 + 0.1, which corresponds to a small degree of misspecification 
of the pre-change parameter. In this case, the CUSUM will be using a slightly 
higher value for the control limit h than is optimal, and there should hence 
be a decrease in performance. Similarly, we assume that the post-change value 
of the parameter is also misspecified, and that the CUSUM is designed under 
the assumption that 9± = 6± + 0.1. So for example, in a situation where the 
parameter changes from a value of 0.3 to 0.6, the CUSUM is designed under 
the false assumption that it is changing from 0.4 to 0.7. This is a relatively 
small degree of misspecification which could easily occur in practice. 

The expected delay was computed in the same way as in the previous 
section, and the results are shown in Table |3J Comparing this to Table [3] in 
the previous section shows that the performance of the misspecified CUSUM 
is substantially worse than the CPM across every value of 9q and 9\ . The CPM 
therefore seems to be a more appropriate change detector in situations where 
9q may not be estimated accurately, and where the exact magnitude of the 
change is unknown. 

6 Conclusions 

We developed a change point model for Bernoulli sequences where new obser- 
vations are being received over time, and the pre-change value of the parameter 
is unknown. Several computational devices were introduced which allows the 
relevant test statistics to be computed recursively in a very efficient manner. 
Our method has very favorable performance compared to the optimal CUSUM 
chart. When the change occurs relatively late in the stream, the CPM gives 
comparable performance to the CUSUM chart which had full knowledge of 
the pre- and post- change Bernoulli parameters. But when the change occurs 
early in the stream, the CPM method performs poorly compared to CUSUM. 
However, in many data stream settings there will not be full knowledge of 
these parameters, and hence this optimal CUSUM is not very realistic. When 
we investigated how the CUSUM performs under small degrees of parameter 
misspecification, it's performance is seen to decline drastically, and the CPM 
seems the best tool in this unknown parameter situation. 



7 Software Implementation 

An implementation of the CPM methodology described in this paper can be 

found in the cpm R package, which is available either from CRAM (http://cran.r-project.org), 

or from the first author's website (http://gordonjross.co.uk). 
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A Additional Tables 
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0.9902 
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0.9523 
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0.9911 


28 





8964 





9051 


0.9259 


0.9522 





9645 





9706 


0.9809 


0.9912 


29 





8958 





9071 


0.9260 


0.9538 





9650 





9706 


0.9812 


0.9920 
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8966 





9057 


0.9268 


0.9549 
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0.9817 


0.9931 
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9057 





9179 


0.9392 


0.9643 





9712 





9771 


0.9857 


0.9956 


50 





9199 





9317 


0.9509 


0.9742 





9759 





9809 


0.9886 


0.9966 


60 





9303 





9411 


0.9597 


0.9817 





9777 





9826 


0.9904 


0.9976 


70 





9381 





9489 


0.9657 


0.9859 





9794 





9842 


0.9918 


0.9979 


80 





9430 





9536 


0.9698 


0.9888 





9807 





9854 


0.9923 


0.9983 


90 





9470 





9575 


0.9738 


0.9904 





9812 





9860 


0.9929 


0.9984 


100 





9486 





9591 


0.9758 


0.9918 





9821 





9867 


0.9934 


0.9985 


200 





9599 





9696 


0.9840 


0.9962 





9844 





9892 


0.9945 


0.9990 


300 





9631 





9728 


0.9860 


0.9971 





9848 





9891 


0.9950 


0.9992 
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9637 





9731 


0.9868 


0.9974 





9852 





9888 


0.9952 


0.9992 
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9652 





9735 


0.9876 


0.9976 
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9897 


0.9953 


0.9992 


600 





9654 





9743 


0.9873 


0.9977 





9847 





9889 


0.9954 


0.9994 


700 





9639 





9747 


0.9876 


0.9978 





9856 





9896 


0.9954 


0.9993 


800 





9668 





9757 


0.9881 


0.9979 





9858 





9896 


0.9953 


0.9993 


900 





9669 





9761 


0.9885 


0.9981 





9859 





9897 


0.9953 


0.9993 


1000 





9671 





9763 


0.9811 


0.9982 





9860 





9897 


0.9954 


0.9994 


2000 





9679 





9767 


0.9892 


0.9984 





9861 





9899 


0.9955 


0.9994 



Table 5: Value of the threshold h t achieving various values of the ARLq for 
the Bernoulli CPM 
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Table 6: Standard deviations accompanying Table 1, for changes from 9o to 6\ 
occurring at location r = 300. 
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Table 7: Standard deviations accompanying Table 2, for changes from 9o to 9\ 
occurring at location r = 50. 
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Table 8: Standard deviations accompanying Table 3, for changes from 9 to 9\ 
occurring at location r = 50. 



